{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Stochastic Gradient Descent (SGD) \n",
    "\n",
    "Stochastic gradient descent is an iterative algorithm that optimizes an objective function by using samples from the dataset. cuML's implementation is mini-batch SGD (MBSGD), which is not implemented by Scikit-learn."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input. \n",
    "\n",
    "For information about cuDF, refer to the documentation: https://docs.rapids.ai/api/cudf/stable\n",
    "\n",
    "For information about cuML's mini-batch SGD implementation: https://rapidsai.github.io/projects/cuml/en/stable/api.html#stochastic-gradient-descent"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "import pandas as pd\n",
    "import cudf as gd\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "from sklearn.datasets import make_regression\n",
    "from sklearn.metrics import mean_squared_error\n",
    "\n",
    "from cuml.linear_model import MBSGDRegressor as cumlSGD\n",
    "from sklearn.linear_model import SGDRegressor as skSGD"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define Parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_samples = 2**20\n",
    "n_features = 399\n",
    "\n",
    "learning_rate = 'adaptive'\n",
    "penalty = 'elasticnet'\n",
    "loss = 'squared_loss'\n",
    "max_iter = 500"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generate Data\n",
    "\n",
    "### Host"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "X,y = make_regression(n_samples=n_samples, n_features=n_features, random_state=0)\n",
    "\n",
    "X = pd.DataFrame(X)\n",
    "y = pd.Series(y)\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "X_cudf = gd.DataFrame.from_pandas(X_train)\n",
    "X_cudf_test = gd.DataFrame.from_pandas(X_test)\n",
    "\n",
    "y_cudf = gd.Series(y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scikit-learn Model\n",
    "\n",
    "### Fit "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "sgd_sk = skSGD(learning_rate=learning_rate, \n",
    "               eta0=0.07,\n",
    "               max_iter=max_iter,\n",
    "               tol=0.001,\n",
    "               fit_intercept=True,\n",
    "               penalty=penalty,\n",
    "               loss=loss)\n",
    "\n",
    "sgd_sk.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "y_sk = sgd_sk.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "error_sk = mean_squared_error(y_test,y_sk)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## cuML Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "sgd_cuml = cumlSGD(learning_rate=learning_rate, \n",
    "                   eta0=0.07, \n",
    "                   epochs=max_iter,\n",
    "                   batch_size=512,\n",
    "                   tol=0.001, \n",
    "                   penalty=penalty, \n",
    "                   loss=loss)\n",
    "\n",
    "sgd_cuml.fit(X_cudf, y_cudf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Predict"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "y_cuml = sgd_cuml.predict(X_cudf_test).to_array().ravel()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Evaluate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "error_cuml = mean_squared_error(y_test,y_cuml)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compare Results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"SKL MSE(y): %s\" % error_sk)\n",
    "print(\"CUML MSE(y): %s\" % error_cuml)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
